Predicting customer churn using complex statistical modeling
Author
Patrick Lefler
Published
February 4, 2026
Strategic Attrition Analytics for Consumer Credit
This project provides a comprehensive data science framework for identifying, analyzing, and predicting customer attrition within a consumer credit card division. By leveraging a historic dataset of over 10,000 records, this analysis moves beyond descriptive reporting to deliver actionable risk intelligence and tactical insights.
Data Source & Composition The underlying data is sourced from the Kaggle Credit Card Customers dataset. It contains anonymized profiles of both current and former clients, blending two distinct data categories:
Demographics: Detailed attributes including age, gender, marital status, income category, and education level.
Account Behavior: Performance metrics such as credit limits, revolving balances, transaction frequency, and bank-initiated communication logs.
Within this population, customer churn (represented by former bank customers) accounts for approximately 16% of the total dataset, providing a robust sample for predictive modeling and behavioral analysis.
Project Objectives The primary goal of this analysis is to transform raw data into a proactive retention strategy through four key methodologies:
Identify Drivers: Utilizing Logistic Regression to isolate specific “Risk Multipliers”—the behavioral factors that significantly increase the likelihood of account closure.
Predict Risk: Deploying a Random Forest machine learning model to assign an individualized churn probability score to every existing customer.
Analyze Lifecycle: Implementing Survival Analysis to map “Customer Life Expectancy,” allowing the bank to identify critical tenure milestones where the risk of departure is highest.
Tactical Action: Generating a prioritized outreach list of at-risk, active customers, enabling the retention team to focus resources where they will have the highest impact.
To determine why customers leave, behavioral signatures—patterns in how individuals use their cards before closing an account are examined. In this analysis, Exploratory Data Analysis (EDA) is utilized to compare the habits of over 10,000 customers. By charting transaction counts against transaction amounts, current at-risk customers can be better identified.
Statistical analysis shows that churned customers aren’t necessarily those with the lowest credit limits; rather, they are the ones who have stopped integrating the card into their daily routine. Identifying this drop in transaction velocity might allow the bank to intervene weeks or months before a customer formally requests to cancel, transforming the strategy from reactive damage control to proactive relationship management.
Interactive comparison of transaction counts and total spending.
Code
#| label: visual-analysis#| fig-cap: "Interactive comparison of transaction counts and total spending."# Corrected interactive scatter plot using exact column names p <-ggplot(attritionData, aes(x = total.transaction.count, y = total.transaction.amount, color = churn)) +geom_point(alpha =0.3) +scale_color_manual(values =c("Existing Customer"="#3e3f3a", "Attrited Customer"="#df691a")) +labs(title ="Transaction Velocity: Usage vs. Attrition Status",x ="Total Transaction Count (Annual)",y ="Total Transaction Amount ($)",color ="Status") +theme_minimal()ggplotly(p)
NoteSegmenting Attrition Across Key Factors
While the Transaction Velocity plot provides a high-level view of account usage, attrition risk is sometimes distributed unevenly across different demographic segments. In this section, a series of comparative plots is created that may allow the reader to observe how attrition rates fluctuate across variables such as age, gender, marital status, income and education. In this case, however, there seem to be no real outliers that could provide realistic insight as to why customer stay or leave.
# plotIncome + plotEducation + plot_annotation(title = 'Customer Attrition Segmentation Across Income & Education')
Statistical Drivers
NoteIdentifying the “Why” with Risk Multipliers
To move beyond simple charts, Logistic regression is utilized to calculate the Risk Multiplier (mathematically, the “exponentiated coefficient”). This multiplier is calculated by taking the raw statistical weights and transforming them into a scale that represents the “Odds of Churn.”
How to Interpret the Numbers:
Multiplier > 1 (Risk Driver): This indicates that as this factor increases, the risk of attrition increases. A multiplier of 1.20 means that for every one-unit increase in that factor, the likelihood of a customer leaving grows by 20%.
Multiplier = 1 (Neutral): This means the factor has no impact on the risk of attrition. It is a neutral variable that does not help us predict whether a customer will stay or leave.
Multiplier < 1 (Protective Factor): This indicates that as this factor increases, the risk of attrition decreases. A multiplier of 0.80 means that for every one-unit increase, the risk of churn is reduced by 20%.
In this updated view, all variables in the model are included to provide a complete picture of every factor recorded by the bank, from age to revolving balance.
The horizontal axis, labeled “Importance Score”, represents the predictive contribution of each factor. In this analysis, a calculation called Gini Impurity is used to determine these scores. Think of impurity as the amount of uncertainty or “clutter” in the data. Every time the model uses a variable like Transaction Count to successfully sort customers into “Stay” or “Leave” buckets, it reduces that clutter.
The scale (ranging from 0 to 250) is a calculated aggregate score, not a direct count of customers. It represents the total amount of clarity gained across all the thousands of decision trees in the model. While a larger database allows for more complex splits—which can result in higher total numbers—the absolute value is less important than the relative distance between the bars. For example, if one factor has a score of 180 and another has 60, the first is three times as powerful at helping predict the future status of an account. The ranking helps ensure that the model is prioritized around the same high-impact behaviors—like usage velocity and revolving balances—that industry experience suggests are the true drivers of risk.
What is a bit perplexing, is that overall age component does not seem to have significant influence as a key indicator of future customer attrition based on the random forest model results below, but certain age factors (ages > 50) rank extremely high on the risk multiplier plot above. More investigative work needs to be performed to better explain the apparent discrepancy.
Code
# Build Random Forest model - explicitly excluding target variables to prevent data leakagerf_model <-rand_forest() %>%set_engine("ranger", importance ="impurity") %>%set_mode("classification") %>%fit(churn ~ . - client.id - attrition.flag, data = attritionData)# 1. Importance Plot: Visualizing the predictive 'heavy lifters' using native ggplotimportance_data <-vi(rf_model) %>%mutate(Variable =str_to_title(str_replace_all(Variable, "\\.", " ")))ggplotData <- importance_data %>%filter(Variable !="Attrition Flag") %>%## Exclude "Attrition Flag" data from plotfilter(Variable !="Client Id") ## Exclude "Client Id" data from plotimportancePlot <-ggplot(ggplotData, aes(x = Importance, y =reorder(Variable, Importance))) +geom_col(fill ="#df691a", alpha =0.8) +labs(title ="Predictive Ranking: Key Indicators of Customer Attrition",subtitle ="Calculated relative contribution to model accuracy",x ="Importance Score (Weighted Information Gain)",y ="Customer Attribute") +xlim(0, 250) +theme_minimal()importancePlot
High-Risk Attrition List Customers
NoteIdentifying customers at risk for attrition
After examining the data, all current customers were ranked from high to low in terms of the calculated probability for them to depart. As one can see, even the highest ranking at-risk customer only has a calculated probability to leave of 10%. The next step would be to drill-down on these at-risk customers to refine the actual probability of departure. More data is needed to improve confidence in the model. As is the case with most machine learning and logistic regression analysis, the quality of the outcome is only as good as the quality of the data.
Tactical Outreach List: Top 50 Highest At-Risk ACTIVE Accounts
ClientID
Gender
Age Bracket
Income
Card Tier
Outstanding Balance
Recent Transaction Amount
Recent Contacts
Churn Probability
785432733
Female
40-49
<$40k
Gold
$ 0
$966
3
10.23%
721425558
Male
50-59
>$120k
Blue
$ 0
$1,536
2
9.92%
709465758
Female
60+
<$40k
Blue
$ 0
$902
3
9.27%
719621958
Male
40-49
$60k-$80k
Blue
$ 0
$1,720
3
8.32%
719038008
Female
40-49
<$40k
Blue
$1,192
$4,862
2
8.11%
712215258
Female
50-59
$40k-$60k
Silver
$ 0
$1,298
3
7.47%
754897008
Male
40-49
$40k-$60k
Blue
$1,418
$1,319
2
7.27%
713497983
Male
40-49
$60k-$80k
Blue
$ 0
$3,459
2
7.26%
827111283
Male
40-49
$80k-$120k
Blue
$578
$1,109
2
7.09%
805259733
Female
50-59
Unknown
Blue
$ 0
$1,731
4
7.03%
713146683
Female
30-39
Unknown
Blue
$ 0
$5,473
2
6.95%
719363283
Female
50-59
<$40k
Blue
$ 0
$1,904
2
6.84%
711757383
Female
50-59
<$40k
Blue
$ 0
$1,905
2
6.82%
708664008
Male
50-59
$80k-$120k
Blue
$ 0
$1,222
4
6.82%
708655983
Female
40-49
Unknown
Blue
$ 0
$1,353
2
6.21%
718086783
Male
50-59
>$120k
Blue
$ 0
$4,738
1
6.21%
717975333
Male
50-59
$80k-$120k
Blue
$1,330
$837
2
6.15%
779743908
Male
40-49
$60k-$80k
Blue
$ 0
$1,196
2
6.09%
820075983
Male
<30
<$40k
Blue
$1,535
$2,299
2
6.03%
823629333
Male
40-49
$40k-$60k
Blue
$ 0
$4,220
1
6.03%
787467858
Male
40-49
$80k-$120k
Silver
$2,045
$4,081
3
6.00%
710662158
Male
40-49
>$120k
Blue
$2,517
$2,051
3
5.96%
709531908
Male
50-59
$60k-$80k
Blue
$ 0
$2,184
2
5.84%
788965683
Female
40-49
<$40k
Blue
$ 0
$2,170
3
5.83%
718934058
Male
30-39
$40k-$60k
Blue
$2,517
$2,396
2
5.82%
771075258
Male
50-59
>$120k
Silver
$1,527
$1,268
2
5.80%
820288233
Female
<30
<$40k
Blue
$ 0
$2,731
4
5.76%
711028308
Female
40-49
<$40k
Blue
$ 0
$1,468
0
5.74%
708741633
Male
50-59
$60k-$80k
Blue
$ 0
$1,771
4
5.74%
709106358
Male
40-49
$60k-$80k
Blue
$ 0
$816
0
5.55%
710044308
Female
40-49
$40k-$60k
Blue
$1,594
$2,480
2
5.53%
713217858
Female
30-39
<$40k
Blue
$ 0
$1,204
3
5.50%
789270033
Male
40-49
$80k-$120k
Silver
$ 0
$1,122
3
5.49%
775112958
Male
50-59
$60k-$80k
Blue
$ 0
$7,781
3
5.48%
718435158
Male
50-59
>$120k
Blue
$ 0
$1,930
4
5.45%
714547458
Male
60+
$40k-$60k
Blue
$ 0
$1,941
2
5.35%
797234508
Male
50-59
$60k-$80k
Silver
$ 0
$1,481
4
5.34%
784798683
Male
30-39
$40k-$60k
Blue
$ 0
$1,999
2
5.33%
824782008
Male
40-49
$60k-$80k
Blue
$510
$1,228
2
5.32%
718361583
Female
50-59
Unknown
Blue
$ 0
$1,559
5
5.29%
710632683
Male
30-39
$40k-$60k
Blue
$ 0
$2,308
3
5.27%
714495258
Male
<30
<$40k
Blue
$479
$1,786
3
5.22%
714878508
Female
40-49
<$40k
Blue
$535
$2,051
3
5.11%
778855383
Female
30-39
<$40k
Blue
$ 0
$1,975
3
5.08%
715279683
Female
50-59
$40k-$60k
Blue
$ 0
$1,679
4
5.06%
718627458
Female
40-49
Unknown
Blue
$ 0
$2,119
2
5.03%
721051908
Male
40-49
$60k-$80k
Blue
$ 0
$1,393
3
4.86%
789562383
Female
30-39
<$40k
Blue
$583
$1,259
0
4.80%
771073908
Male
60+
$40k-$60k
Blue
$ 0
$1,406
2
4.79%
803776533
Male
40-49
$60k-$80k
Blue
$ 0
$2,118
2
4.79%
The Attrition Timeline
NoteVisualizing Customer Life Expectancy
This final section uses survival analysis to map the customer lifecycle. Sudden drops in this curve indicate risk milestones - specific anniversary dates where customers are statistically most likely to reconsider their relationship with the bank. In this case, there appears to be a significant drop-off in customer retention at the three-year point. Perhaps this could be caused by new customers being offered three years of below-market financing, or other inducements. It’s certainly a good place to start further investigation.
Code
#| label: survival-timelinesurv_obj <- attritionData %>%mutate(status =ifelse(attrition.flag =="Attrited Customer", 1, 0))fit_km <-survfit(Surv(months.on.book, status) ~1, data = surv_obj)ggsurvplot(fit_km, data = surv_obj,palette ="#df691a",title ="Customer Retention Probability by Tenure",xlab ="Months on Book (Customer Lifecycle)",ylab ="Retention Probability",ggtheme =theme_minimal())
Key Findings
Analysis of more than 10,000 customer records confirms that attrition within the consumer credit card division is rarely a sudden event; rather, it is characterized by a gradual “behavioral drift.” By leveraging a Random Forest model, the identified primary predictors of churn are not demographic markers—such as age or income—but rather, are caused by other factors including declines in transaction velocity and average utilization. When a customer’s total transaction count drops or they cease maintaining a revolving balance, they may be signaling an intent to depart months before the account is formally closed.
Furthermore, survival analysis identified a critical “tenure risk” at the 36-month milestone. This suggests that as customers reach their third anniversary with the bank, initial product appeals or promotional incentives often lose their efficacy. To mitigate this risk, a structural shift in the Customer Lifecycle Management process may be needed; specifically, the implementation of automated stay-active incentives and product reviews timed for this three-year window.
This project represents only a starting point in demonstrating how machine learning and logistic regression can solve complex challenges within the credit and risk markets. By transforming data into proactive intelligence, institutions can intervene earlier and preserve valuable customer relationships.